Lesson 4


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(ggplot2)
pf <- read.csv("../lesson3/pseudo_facebook.tsv", sep = "\t")
qplot(age, friend_count, data = pf)


What are some things that you notice right away?

Response: Most people have small number of friends on Facebook. Some young people have a large number of friends.


ggplot Syntax

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
    geom_point() +
    xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).


Overplotting

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
    geom_jitter(alpha = 1 / 20) +
    xlim(13, 90)
## Warning: Removed 5154 rows containing missing values (geom_point).

What do you notice in the plot?

Response: Most user have less than 500 friends.


Coord_trans()

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
    geom_point(alpha = 1 / 20, position = position_jitter(h = 0)) +
    xlim(13, 90) +
    coord_trans(y = "sqrt")
## Warning: Removed 5173 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(x = age, y = friend_count), data = pf) +
    geom_point(alpha = 1 / 20) +
    xlim(13, 90) +
    coord_trans(y = "sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).

What do you notice?

The black bars look higher.


Alpha and Jitter

Notes:

ggplot(aes(x = age, y = friendships_initiated), data = pf) +
    geom_point(alpha = 1 / 10, position = position_jitter(h = 0)) +
    coord_trans(y = "sqrt")


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>%
    group_by(age) %>%
    summarise(friend_count_mean = mean(friend_count),
              friend_count_median = median(friend_count),
              n = n())
head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

Create your plot!

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
    geom_line()


Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
    geom_point(alpha = 1 / 20, position = position_jitter(h = 0), 
               color = "orange") +
    coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
    geom_line(stat = "summary", fun.y = mean) +
    geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.1),
              linetype = 2, color = "blue") +
    geom_line(stat = "summary", fun.y = quantile, fun.args = list(probs = 0.9),
              linetype = 2, color = "blue") +
    geom_line(stat = "summary", fun.y = median, color = "blue")

What are some of your observations of the plot?

Response: All lines except the 10% quantile have similar trends. The median is smaller than the mean.


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

cor.test(pf$friend_count, pf$age)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$friend_count and pf$age
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf, age <= 70), cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes:


Create Scatterplots

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
    geom_point() +
    coord_cartesian(xlim = c(0, 4e4), ylim = c(0, 1e5))


Strong Correlations

Notes:

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
    geom_point() +
    xlim(0, quantile(pf$www_likes_received, 0.95)) +
    ylim(0, quantile(pf$likes_received, 0.95)) +
    geom_smooth(method = "lm", color = "red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(pf, cor.test(www_likes_received, likes_received))
## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response: 0.948


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

library(alr3)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data("Mitchell")

Create your plot!

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
    geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot. 0

  2. What is the actual correlation of the two variables? 0.057 (Round to the thousandths place)

with(Mitchell, cor.test(Month, Temp))
## 
##  Pearson's product-moment correlation
## 
## data:  Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes:

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
    geom_point() +
    scale_x_continuous(breaks = seq(0, 203, 12))


A New Perspective

What do you notice? Response: The temperature is periodic every year.

Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes:

pf$age_with_months <- with(pf, age + 1 - dob_month / 12)

Age with Months Means

pf.fc_by_age_months <- pf %>% 
    group_by(age_with_months) %>%
    summarise(friend_count_mean = mean(friend_count),
              friend_count_median = median(friend_count),
              n = n()) %>%
    arrange(age_with_months)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Programming Assignment

age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
                                  friend_count_mean = mean(friend_count),
                                  friend_count_median = median(friend_count),
                                  n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
head(pf.fc_by_age_months2)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Noise in Conditional Means

ggplot(aes(x = age_with_months, y = friend_count_mean), 
       data = subset(pf.fc_by_age_months, age_with_months < 71)) +
    geom_line()


Smoothing Conditional Means

Notes:

p1 <- ggplot(aes(x = age, y = friend_count_mean), 
       data = subset(pf.fc_by_age, age < 71)) + 
    geom_line() + 
    geom_smooth()
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), 
             data = subset(pf.fc_by_age_months, age_with_months < 71)) +
    geom_line() +
    geom_smooth()
p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count),
             data = subset(pf, age < 71)) +
    geom_line(stat = "summary", fun.y = mean)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, p3)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'


Which Plot to Choose?

Notes:


Analyzing Two Variables

Reflection: For data with discrete values, it’s useful to use jitter and transparency for visualization.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!